Problem

Problem

We need data

  • APIs of Social Network Sites are being closed (Facebook, Twitter, Reddit)
  • Webscraping becomes more difficult
  • Post-API Age (Freelon 2018)

Solution

Solution

Sharing collected data

Data Sharing

Pros

  1. Adds credibility to research (reproducibility)
  2. Boost citation numbers
  3. Promote research

Cons

  1. unethical
  2. illegal
  3. jail?

Avenues for Sharing Data

  1. Share the entire data

But: copyright, privacy, and other concerns

  1. Make Metadata Available for Rehydration of the Data

But: original data might disappear

  1. Making pre-processed versions of data available1

But: preprocessing techniques are constantly improved

  1. Make Non-consumptive Research Capabilities Available

What is Nonconsumptive Research?

  • term from biology/ecology
  • study or analyze materials without destroying, altering, or “consuming” them
  • like photographing wildlife behavior rather than collecting specimens

What is Nonconsumptive Research?

What is Nonconsumptive Research?

What is Nonconsumptive Research?

From Duneier (2017)

What is Nonconsumptive Research?

From Duneier (2017)

👉 Analysing the data without ‘consuming’ (i.e., reading/listening/watching) it

Use cases

  1. Make selected computational analyses available on existing (inaccesible) datasets
  2. Promote inaccesible dataset to draw users to the Secure Data Centre
  3. Select relevant subset for rehydration
  4. Provide tools to perform parts of the analysis pipeline until derivates are safe to share

Use cases

  1. Make selected computational analyses available on existing (inaccesible) datasets
  2. Promote inaccesible dataset to draw users to the Secure Data Centre
  3. Select relevant subset for rehydration
  4. Provide tools to perform parts of the analysis pipeline until derivates are safe to share

What we need to offer Nonconsumptive Research?

Software/Infrastructure to…

  1. enable analyses
  2. limit access to raw data

Implementation (example) by van Atteveldt, W., and Welbers, K., Gruber J. B.:

AmCAT 4: Demo

Use cases

  1. Make selected computational analyses available on existing (inaccesible) datasets
  2. Promote inaccesible dataset to draw users to the Secure Data Centre
  3. Select relevant subset for rehydration
  4. Provide tools to perform parts of the analysis pipeline until derivates are safe to share

Steps in Supervised Machine Learning

flowchart LR
    A[Raw Date] --> Train(Training Data)
    A[Raw Date] --> Test(Test Data)
    A[Raw Date] --> Unseen(Unseen Data)

    Train -->|Pre-Process| Train2(Training Vectors)
    Test -->|Pre-Process| Test2(Test Vectors)
    Unseen -->|Pre-Process| Unseen2(Unseen Vectors)

    Train2 -->|Train| Model(Model)
    Test2 -->|Validate| Model
    Unseen2 -->|Employ| Model
    
    Model -->|Analysis| Insights

Steps in Supervised Machine Learning

flowchart LR
    A[Their Raw Date] --> Train(Training Data)
    A[Their Raw Date] --> Test(Test Data)
    B[Our Raw Date] --> Unseen(Unseen Data)

    Train -->|Pre-Process| Train2(Training Vectors)
    Test -->|Pre-Process| Test2(Test Vectors)
    Unseen(Unseen Data) -->|**Pre-Process**| Unseen2(Unseen Vectors)

    Train2 -->|Train| Model(Model)
    Test2 -->|Validate| Model
    Unseen2 -->|Employ| Model
    
    Model -->|Analysis| Insights

    %% Styling
    classDef serverStyle fill:#FFD700,stroke:#333,stroke-width:2px,font-weight:bold
    class B serverStyle
    class Unseen serverStyle
    class Unseen2 serverStyle
    linkStyle 2 stroke:#FFD700,stroke-width:3px
    linkStyle 5 stroke:#FFD700,stroke-width:3px

Destructive Preprocessing: Examples

Document Feature Matrix:

id i like pizza love data
text1 1 1 1 0 0
text2 1 0 1 1 0
text3 1 0 0 1 1

LLama3 Embeddings:

id dim_1 dim_2 dim_3 ... dim_4096
1 0.2 -2 3 ... -3
2 0.3 -2 3 ... -4
3 -2.1 -2 6 ... -2

Original can’t be reconstructed 👉 safe to share

Coming soon…

Thank you for your attention!

paper

References

Duneier, Mitchell. 2017. Ghetto: The Invention of a Place, the History of an Idea. First paperback edition. New York: Farrar, Straus; Giroux.
Freelon, Deen. 2018. “Computational Research in the Post-API Age.” Political Communication 35 (4): 665–68. https://doi.org/10.1080/10584609.2018.1477506.